Model-Free Control

In model-free control, we do not have a model of the environment. The model can be unknown, incomplete, or computationally expensive. We directly learn the policy or the action-value function.

Prediction is the process of estimating the value function, while control is the process of finding the optimal policy. The objective function for control is the action-value function:

Q(s,a)=E[GtSt=s,At=a]Q(s, a) = \mathbb{E}[G_t | S_t = s, A_t = a]

On-Policy Control:On-policy control methods learn the policy that they are following. The policy is updated based on the action-value function.

Off-Policy Control: Off-policy control methods learn the policy that is different from the policy that they are following. The target policy is updated based on the action-value function. The behavior policy is the policy that is followed.

Policy evaluation is the process of estimating the value function for a given policy. Policy improvement is the process of finding a better policy based on the value function.

SARSA

SARSA is an on-policy control method. It is a model-free method that learns the action-value function.

Q(St,At)Q(St,At)+α[Rt+1+γQ(St+1,At+1)Q(St,At)] Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]

Every time step, the policy evaluation is done with SARSA, and the policy improvement is done with the ϵ\epsilon-greedy policy.

n-step SARSA: The update is done with n steps. The return is calculated with n steps.

qt(n)=Rt+1+γRt+2++γn1Rt+n+γnQ(St+n,At+n) q_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{n-1} R_{t+n} + \gamma^n Q(S_{t+n}, A_{t+n})
Q(St,At)Q(St,At)+α[qt(n)Q(St,At)] Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [q_t^{(n)} - Q(S_t, A_t)]

Forward View SARSA: The return is calculated with all the steps.

qtλ=(1λ)n=1λn1qt(n) q_t^{\lambda} = (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{n-1} q_t^{(n)}
Q(St,At)Q(St,At)+α[qtλQ(St,At)] Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [q_t^{\lambda} - Q(S_t, A_t)]

Q-Learning

Q-learning is an off-policy control method. It is a model-free method that learns the action-value function.

The next action is selected based on the behavior policy. The update is done based on the target policy.

Q(St,At)Q(St,At)+α[Rt+1+γmaxaQ(St+1,a)Q(St,At)] Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma max_a Q(S_{t+1}, a) - Q(S_t, A_t)]

Both behavior and target policies are improved. Target policy is improved with the greedy policy improvement while the behavior policy is improved with the ϵ\epsilon-greedy policy.


#MMI706 - Reinforcement Learning at METU